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Arnoldi-based Sampling for High-dimensional 
Optimization using Imperfect Data 


Jason E. Hicken* and Anthony Ashley^ 

Rensselaer Polyteehnie Institute, Troy, New York, 12180 


We present a sampling strategy suitable for optimization problems characterized by 
high-dimensional design spaces and noisy outputs. Such outputs can arise, for example, in 
time-averaged objectives that depend on chaotic states. The proposed sampling method 
is based on a generalization of Arnoldi’s method used in Krylov iterative methods. We 
show that Arnoldi-based sampling can effectively estimate the dominant eigenvalues of 
the underlying Hessian, even in the presence of inaccurate gradients. This spectral in¬ 
formation can be used to build a low-rank approximation of the Hessian in a quadratic 
model of the objective. We also investigate two variants of the linear term in the quadratic 
model: one based on step averaging and one based on directional derivatives. The resulting 
quadratic models are used in a trust-region optimization framework called the Stochastic 
Arnoldi’s Method (SAM). Numerical experiments highlight the potential of SAM rela¬ 
tive to conventional derivative-based and derivative-free methods when the design space is 
high-dimensional and noisy. 


I. Introduction 

Large-scale numerical simulations play a central role in contemporary aircraft design, and, increas¬ 
ingly, this role goes beyond the use of simulations as proxies for experiments. This trend is exemplified 
by differential-equation-constrained optimization (DECO), which couples simulations with numerical opti¬ 
mization. DECO has been used to optimize highly-refined aircraft where small improvements translate into 
significant economic and environmental benefit^ Moreover, DECO has the potential to enable the design 
of unprecedented aircraft configurations where empirical data is sparse and intuition is lacking. 

Clearly, DECO has enormous value and potential for aerospace engineers; however, while DECO algo¬ 
rithms for steady and periodic deterministic systems are maturing, there remains a broad class of problems 
that cannot be optimized with conventional algorithms. These problems exhibit 

1. a high-dimensional design space, 

2. complex physics that must be modeled using large-scale simulations, and 

3. simulation outputs, e.g. lift force or total energy, that are “imperfect”. 

In this context, imperfeet outputs are quantities of interest whose numerical errors cannot be eliminated, 
at least in practice; consequently, such outputs fail to meet the underlying assumptions of conventional 
gradient-based optimization algorithms. The word imperfect is intentionally chosen to distinguish these 
errors from more traditional numerical errors that can, usually, be estimated and effectively reduced. In the 
following sections, we briefly elaborate on two such sources of imperfect data. 

^Assistant Professor, Department of Mechanical, Aerospace, and Nuclear Engineering, Member AIAA 
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^For example, a 1% reduction in the fuel consumed by the worldwide fleet would result in 7 million fewer tonnes of C02 
emitted per year 
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Figure 1. Example trajectories of the Lorenz DE (left) illustrating sensitive dependence on initial conditions. 
A time-averaged objective (right) exhibits high-frequency fluctuations; moreover, only approximate derivatives 
can by computed, in this case using an ensemble adjoint. 


A. Time-averaged Outputs from Chaotic Systems 

The outputs of interest in many engineering systems are time-averages of chaotic solutions. Relevant exam¬ 
ples include the lift and drag on aerodynamic bodies [^, the energy produced by a fusion reactor [^, and 
the (phase-averaged) pressure in an internal-combustion engine [^. 

Chaotic systems are characterized by a sensitive dependence on initial conditions; the distance between 


two states that are “close” initially will exponentially increase with time. This is illustrated in Figure 1(a) 
using the Lorenz DE and two initial conditions that satisfy ||Axo|| < 10“^^. 

In addition to the initial conditions, chaotic systems are sensitive to other parameters. This has signifi¬ 
cant implications for time-averaged outputs, which we illustrate using the Lorenz system and the objective 
function 


2T Jo 


p) ^targ) dt, 

35, and T > 0 is the period of integration. The 


where z{t^p) is one of the Lorenz state variables, Ztarg - 
design variable here is p, a parameter in the Lorenz DE. 

Eigure 1(b) | plots the Lorenz objective Jt versus the parameter p for an averaging period of T = 400. 
High-frequency oscillations can be observed in the objective function Jt- In theory, we could eliminate these 
fluctuations by integrating over an infinite time horizon, but, in practice, we must truncate the simulation 
at a finite T. The oscillations in Jt reflect the sensitivity of the Lorenz DE to changes in p, and they hint 
at the difficulties of using gradient-based optimization. Indeed, as T ^ oo the gradient of Jt diverges 
despite the fact that the objective itself converges. 

Researchers have proposed methods to compute derivatives of objectives that depend on chaotic systems, 
such as the ensemble adjoint [6]-[^ and least-squares adjoint |^. These methods share one shortcoming: 
they produce estimates of the derivatives only. Eigure [T(^ illustrates some derivatives estimated using the 
ensemble adjoint. Although the derivatives capture the general slope of the objective, errors are clearly 
visible. Such errors are incompatible with conventional gradient-based optimization. 


B. Uncertainty Propagation in High-dimensional Input Spaces 

Numerical simulations contain many sources of uncertainty. Eor example, turbulence and combustion models 
introduce errors with respect to the true physics, the operating conditions of a system are not deterministic 
(e.g. the Reynolds number is not known precisely), and perturbations to the intended design are introduced 
by the manufacturing process. The field of optimization under uncertainty (OUU), a branch of uncertainty 
quantification (UQ), seeks to account for such uncertainties during the optimization process. OUU can help 
ensure good performance over a range of parameters, and it can help reduce the probability of failure. 

Typically, at each iteration of an OUU, uncertainties in the input parameters must be propagated through 
the numerical simulation to deduce uncertainties in the outputs, e.g. the standard deviation in a force or 
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Figure 2. Uncertainty propagation can be made tractable for large numbers of (aleatory) inputs by identi¬ 
fying dominant subspaces using, e.g., the Hessian (left). However, accurately differentiating such dimension- 
reduction strategies is usually not practical, and errors are inevitable (right). 


energy. Uncertainty propagation is especially challenging when there are a large (> 100) number of uncertain 
input parameters. These situations require approximations to cope with the curse of dimensionality”. 

Several parameter-selection and model-reduction strategies have been proposed for the propagation of 
stochastic, or aleatory, uncertainties in high-dimensional input spaces 


10 


12 


epistemic uncertainties, i.e. model errors, in high-dimensional spaces have also been investigated 13 


Methods of propagating 

. If 


14 


these methods are to be used for high-dimensional OUU problems, derivative information will be needed; 
however, differentiating these propagation methods accurately is not practical. Therefore, an optimization 
algorithm must account for the discrepancy between the outputs and their derivatives. 

To illustrate why dimension-reduction strategies are difficult to differentiate, consider computing the 
expected value, E[f{x)], of the Rosenbrock function 


f{x + ^) = [1 - (x + + 100 [{y -7])-{x + ^ , 

are the design variables and G A/’(0,0.5^) are Gaussian random variables. In 
this ex ampl e, we use the dominant eigenvalue of the Hessian to define a subspace for dimension reduction. 
Figure 2(a) shows one such subspace defined at cc = (—0.5,0)^ 


where x = {x^yf^ 


Figure |2(b) plots the expected value obtained by (correctly) accounting for changes in the subspace. 
The figure also plots the expected value obtained by fixing the subspace, which reflects how derivatives 
would be computed for most dimension-reduction strategies. Fixing the subspace results in a 12% error in 
the derivative. Note that, to correctly compute the derivative in the present example, we would need to 
differentiate eigenvectors of the Hessian, which would be impractical for high-dimensional outputs based on 
simulations. 


C. Other Sources of “Imperfect” Outputs 

Time-averaging in chaotic systems and uncertainty propagation in high-dimensional spaces are two important 
examples that produce imperfect data and motivate the current work. However, inconsistencies between the 
outputs and their derivatives can also arise in other applications of DECO. 


1. Cut-cell 15 -17 and immersed-boundary methods 18 ^ are popular and effective methods for numer¬ 
ically solving DEs on complex, moving geometries. In geometry optimization, these methods produce 
discontinuities in the design space as the mesh-topology and/or stencil is updated. A similar problem 
exists when unstructured grids are regenerated during shape optimization. 


2. The differentiate-then-discretize 


21 


or 


“continuous- 

tent with the discretized output. See, for example, 23 


22 , adjoint yields derivatives that are inconsis- 


24 


3. Incomplete sensitivities have been proposed 25 26 to reduce the computational cost of computing 
the gradient. The terms that are dropped or estimated in these approaches necessarily result in 
approximate derivatives. 
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4. A multi-fidelity approach may be used iu which a higher fidelity tool produces the output but a lower 


fidelity method is used to estimate gradieuts 27 


The iuaccuracies 1 aud 2 described above are discretizatiou errors that cau be reduced through mesh re- 
fiuemeut. However, the mesh refiuemeut ueeded to sufhcieutly reduce these errors may uot be possible iu 
practice, aud iu these cases the algorithms proposed below could be helpful. 


D. Imperfect Data and the State-of-the-art 

The examples above highlight importaut emergiug applicatious — high-dimeusioual uucertaiuty propagatiou 
aud simulatiou of chaotic systems — iu which outputs of iuterest aud their derivatives exhibit some form of 
iuaccuracy. If we waut to optimize these outputs, cau we rely ou state-of-the-art optimizatiou methods? 

For low-dimeusioual desigu spaces, derivative-free optimizatiou methods could be effective for the ap- 
plicatious described above. These methods iuclude determiuistic iuterpolatiou/regressiou-based surrogate 
models 28 - 30 , the Nelder-Mead simplex method 31 , aud geuetic algorithms 1^- However, “it is usually 
uot reasouable to try aud optimize problems with more thau a few teus of variables...” with derivative-free 
methods 


30 


Derivative-based algorithms are highly scalable, makiug them ideally suited for optimizatiou of smooth 
fuuctious iu large desigu spaces. Although the challeuge of differeutiatiug the simulatiou software is a 
poteutial drawback, algorithmic-differeutiatiou 34 has made this task easier. Uufortuuately, derivative- 


based algorithms are also uot suitable for the target applicatious, because they require sufhcieut accuracy iu 
the data aud cousisteucy betweeu the outputs aud their derivatives. 

Stochastic approximatiou (SA) algorithms 


35 38 


cau be used to optimize fuuctious whose evaluatiou 

There are both theoretical 

aud practical issues with applyiug SA to the applicatious described earlier. Theoretically, the “uoise” must 


coutaius uoise. Some forms of SA also permit the use of uoisy gradieuts 35 39 


be uubiased aud iudepeudeut of the desigu space 38 , aud these assumptious are uot met by the target 


applicatious. From a practical perspective, SA algorithms teud to use mauy “low quality” iteratious to eusure 
(probabilistic) couvergeuce, so fuuctiou aud gradieut evaluatious must be iuexpeusive. This requiremeut is 
also uot fulfilled by the DE-coustraiued optimizatiou problems uuder cousideratiou. 

Reduced-order models (ROM) offer a distiuct approach from derivative-free aud derivative-based opti¬ 
mizatiou. Rather thau tackliug the DE-based optimizatiou directly, ROM methods first seek a simplified, aud 
presumably less expeusive, model for the DE simulatiou. Subsequeutly, the optimizatiou is performed usiug 
the ROM as a surrogate for the full DE model. Methods iu this class iuclude projectiou-based approaches 
like proper-orthogoual decompositiou 40 -44 . These approaches typically fix the desigu parameters while 


coustructiug the ROM, although a method that accouuts for parameter depeudeuce was receutly proposed 


by Liebermau et al. 45 iu the case of steady, liuear DEs. However, buildiug parametric ROMs for uouliuear 


aud/or uusteady DEs remaius au active aud challeugiug area of research. 


II. Arnoldi Sampling 

The primary coutributiou of this work is a sampliug procedure that is suitable for high-dimeusioual 
iuput spaces wheu the gradieut is available, but poteutially iuaccurate. The sampliug procedure is based ou 
Aruoldi’s method, which we briefly review below. 


A. Arnoldi’s Method 

Aruoldi’s method is fouud iu Krylov subspace methods for spectral aualysis aud solviug liuear systems; see, 
for example, 47 aud 48 aud the refereuces thereiu. Iu those applicatious, Aruoldi’s method is used to 
coustruct au orthogoual basis for the Krylov subspace /C^(A, z) = spaujz, Az, A^z, • • • , where A is 

a square matrix aud ^ is a vector. 

A versiou of Aruoldi’s method based ou modified Gram-Schmidt is provided iu Algorithmic for refereuce. 
Au importaut feature of Aruoldi’s method is that A is uot required explicitly: ouly matrix-vector products 
of the form Azj are ueeded. We exploit this aspect of the algorithm iu the proposed sampliug procedure. 

To uuderstaud the origius of the proposed sampliug procedure, it is helpful to review how Aruoldi’s 
method is traditioually used iu the coutext of optimizatiou. Efhcieut optimizatiou algorithms for ob¬ 
jectives apply Newtou’s method, or a quasi-Newtou method, to the first-order optimality couditious. Eor 
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Algorithm 1: Arnoldi’s method. 

Data: zi G such that || 2 :i|| = 1 

Result: Z = [ 2 : 1 , 2 : 2 ,, 2 :^+ 1 ], an orthonormal basis for /C^+i(A, zi) 


1 

2 

3 

4 

5 

6 

7 

8 
9 

10 


for j = 1, 2,..., m do 

^i+i = 

for i = 1,..., j do (Modified Gram-Schmidt) 

Zj-^i ^ Zj-^i — hijZj 

end 

compute hj^ij = 112:^+11| 

if ||hj+ij|| = 0 then return; (check for breakdown) 

Zj+i ^ Zj+ilhj^ij 

end 


unconstrained convex problems, Newton’s method produces systems of the form 


\Np = -g, 

where p is a trial step, g = V/ is the gradient of the objective, and W = V^/ is the Hessian, or an 
approximation to it. When Algorithmis used in this setting, A becomes W and Arnoldi’s method reduces 
to the symmetric Lanczos algorithm that appears in the conjugate gradient method. 

Thus, when applied to optimization problems, Arnoldi’s method requires Hessian-vector products. These 
products can be computed in several ways, including algorithmic differentiation 34 and, when PDEs are 


involved, second-order adjoints 49 - 51 . The products can also be computed using a finite-difference approx¬ 


imation applied to the gradient, since 


N/]^, = lim 


g{x + ezj) - g{x) 


B. Arnold! Sampling 

Arnoldi’s method can be transformed into a sampling procedure by recognizing that the Hessian-vector 
products selected by the algorithm represent infinitesimal samples. If these infinitesimal perturbations are 
made finite, they can be used as a sampling procedure. In other words, the sample locations are defined by 


Xj = xq azj , 


where o > 0 is the sample radius, and the zj are defined by Arnoldi’s method with Azj replaced with 
[g{xo + azj) — g{xo)] /a. This Arnoldi sampling procedure is listed in Algorith m 

In addition to providing the sample locations and sampled data. Algorithm |2]also produces approxima¬ 
tions to the eigenvalues and eigenvectors of the Hessian; see lines [T^ and [T^ This approximation is based 


on iterative eigenvalue methods 47 52 


The eigen-decomposition in Arnoldi sampling uses the symmetric part of H^, the mxm upper Hessenberg 
matrix composed of the hij. In contrast, the conventional Arnoldi’s method for spectral analysis uses the 
matrix itself. We use the symmetric part, because reduces to a symmetric matrix when the gradients 
are accurate and a ^ 0. This also avoids the generation of a complex spectrum, which would not be 
appropriate in the context of a Hessian. 

When the gradients are accurate and a is chosen suitably, Arnoldi sampling reduces to a Arnoldi’s method 
with finite-difference approximations for the Hessian-vector products. Like finite-difference approximations, 
a must be chosen carefully to achieve optimal performance. Unlike finite-difference approximations, our 
proposed method benefits from a relatively large sampling radius, one that would normally cause undesirable 
truncation errors in a finite-difference approximation. 


C. Eigenvalue Accuracy Study 

We use the dominant eigenvalues produced by the Arnoldi sampling procedure to build a quadratic model for 
optimization; therefore, it is important to verify and quantify the accuracy of the approximate eigenvalues. 
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Algorithm 2: Arnoldi sampling. 


Data: m > 0, o > 0, xq, /o = /(^o) and qq = g{xQ) 

Result: sampled locations = [xq, xi, X 2 ,..., x^] 

sampled function values F^+i = [/o, /i,..., fm] 
sampled gradient values G^+i = [ 5 ^ 0 , 5 '!, • • • 
approximate eigenvalues, = [Ai, A 2 ,..., A^], and 
approximate eigenvectors of the Hessian = ['^ 1 , '^ 2 , • • •, '^m] 


1 

2 

3 

4 

5 

6 

7 

8 
9 

10 

11 

12 

13 

14 

15 


set zi = -^o/||^o|| 

for j = 1, 2,..., m do 

set Xj = xo + azj 

sample fj = f{xj) and gj = g{xj) 

compute Zj^i = {gj - go)/a 

for i = 1,..., j do (Modified Gram-Schmidt) 

hij = zJ^^Zj 

Zj^i •(— Zj-f-i — hijZj 

end 

compute hj+ij = \\zj+i\\ 

if ||hj+ij|| = 0 then (check for breakdown) 

I set m = j and break 

end 

Zj^i ^ Zj^i/hj^ij 

end 


16 compute eigen-decomposition of the symmetric part of H^, i.e. ^ [H^ + H^] = V^A^. 

17 compute the approximate eigenvectors 


To this end, we apply Algorithmto a set of quadratic objectives with different eigenvalue distributions. In 
particular, the objective is defined by 

F{x) = (1) 

where E denotes the 2^ x 2^ orthonormalized Hadamard matrix, whose columns are the synthetic eigenvectors. 
The diagonal matrix H G holds the synthetic eigenvalues, given by 


= 


1 


where g = |, 1, or 2. 

For this investigation, we consider n = 256 variables. We run Arnoldi sampling with m = 16 iterations and 
a sample radius of a = 1. The initial point about which samples are taken is {xo)i = sin(i), i = 1,2,..., 256. 
Noise with a Gaussian distribution is added to each component of the gradient. The noise has mean zero 
and its standard deviation is either 0.5%, 2.5% or 5%, relative to the norm of the initial gradient at xq. 
Figure shows the relative error in the first 8 estimated inverse eigenvalues: 


erroq = 



/A, 


-1 


Xi/~Xi - 1 


where A^ is the estimated eigenvalue and Xi is the exact eigenvalue. Recall that the eigenvalue distributions 
are given by {l/v^}f£i, {l/ 0 i=i 5 fhe results for the three distributions are plotted left to 

right in the figure. The blue dot in the figures denotes the median value of the error over 100 runs, and the 
lower and upper bars represent the 0.025 and 0.975 probability quantiles, respectively. The magnitude of 
the noise increases from 0.5% in the top set of figures to 5% in the bottom set of figures. For reference, the 
red squares denote the error in the estimated eigenvalues when no noise is present. 

Overall, the plots suggest that the dominant eigenvalues estimated by Arnoldi sampling are resilient over 
a range of noise magnitudes. In addition, the results indicate that spectra whose energy decays quickly are 
less affected by noise; nevertheless, it is important to point out that relative errors in eigenvalues with small 
magnitude can have a pronounced impact on the optimization steps. 
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(a) 0.5% noise 
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(b) 2.5% noise 
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(c) 5% noise 
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Figure 3. Relative accuracy in the inverse of the eigenvalues approximated using Arnoldi sampling with varying 
noise magnitude. 
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(a) accurate gradient 


(b) inaccurate gradient 


Figure 4. Even when the Hessian is exact, a small error in the gradient can significantly impact the quality of 
the step. 


The results in Figure also suggest that Arnold! sampling should not be used with relative noise above 
5%; however, this conclusion pertains to the chosen radius oi a = 1 only, since an increase in a tends to 
reduce the impact of noise. On the other hand, an increase in a risks an increase in truncation error for 
strongly nonlinear functions. 


III. Stochastic Arnoldi’s Method 

In this section, we describe how Arnold! sampling can be used in an optimization framework. The key 
idea is to form a quadratic model based on the estimated eigenvalues and eigenvectors, and seek a candidate 
step p in the span of the eigenvectors: 

min q{p) = / + p + ^p^VAV^p, subject to ||p|| < A, (2) 

pGspan(V) 2 

where the diagonal matrix A G holds the r < m largest (in magnitude) eigenvalue estimates from 

Arnold! sampling, and V G are the corresponding eigenvector estimates. Note that il) includes a 

trust-radius constraint, with radius A > 0, to handle the possibility of nonconvex models. 

We have not yet specified how / and g are estimated in the trust-region subproblem <§■ While the 
function value has no influence on the solution of p, it does play a role in the step-acceptance procedure 
described later. The linear term p, on the other hand, does impact p. 

A naiVe choice would be to set g = V/(xo), i.e. the gradient evaluated at the initial poinf|^ Although 
this is a natural choice for accurate data, it can lead to inaccurate steps in the presence of errors, even if the 
Hessian model is exact; this is illustrated in Figure In the following subsections, we propose two possible 
choices for g that attempt to ameliorate errors in the gradient. 

A. Variant 1: Step Average 

We can view each of the points Xj^j = 0,1,... ,m, produced by Arnold! sampling as a potential location at 
which to center the quadratic model. For each of these locations, we can adopt the corresponding gradient, 
Pj, in the model q{p) and find a step. Let pj = Vpj denote this step. Substituting this step into Q, and 
ignoring the trust-radius constraint for the moment, the solution is 

Pj = -VA-^V^p,. 

The trial solution for point j is Xj +Pj. Averaging over all potential trial solutions, we obtain 

^ m 

= - —r + Pj) = x- VA“^V^5, 

m + 1 ^ 

J=0 

where x = Xj)/{m + 1) and p = gj )/(m -b 1). In the sequel, we will refer to this as the step-average 

approach. Note that, if the trust-radius constraint is present, we solve (§ with p defined as the average 
gradient. 

^The initial point for Arnoldi sampling is the current iterate of the outer optimization loop. 
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B. Variant 2: Directional Derivatives 

Our numerical experiments suggest that step averaging is effective when the error has mean zero; however, 
it does not address bias in the error. This is not surprising, because ^ is a simple average, so the bias will 
persist. As an aside, we remark that constant bias in the set {gj}JLQ does not impact the eigenvalue and 
eigenvector approximation in Arnoldi sampling, because these approximations are based on differences in 
the gradient. 

To address bias in the construction of we examine the form of the (unconstrained) solution when the 
step is in the subspace V and g is the exact gradient: 

p = V^, where y = A“^ (—. 

We see that the reduced-space solution, y, is the inverse of A acting on the reduced-space negative gradient. 
This is an important observation, because it means we can focus on approximating V^g rather than g. 

The reduced gradient, is the directional derivative of the objective in the estimated eigenvector 

directions, V. These directional derivatives can be approximating using the sampled function values: 


^ 9 ^ ,^red — 

’ a 


where ^red denotes the approximate reduced gradient. By using the function values to approximate the 
directional derivatives, we eliminate bias in the gradient error. 

When the trust-region constraint is present, the reduced-space solution is found by solving the trust- 
region problem with g = V^red, and adding the step, p = V^, to the point xq as defined in Arnoldi 
sampling. We will refer to this as the directional-derivative variant. 


/l - /o 

/2 - /o 

fr - fo 


C. Comparison of the Two Variants 

We use the synthetic quadratic objective, defined earlier by 0, to study and compare the two proposed 
variants for g. As before, we use n = 256 variables and set o = 1. The initial point provided to the Arnoldi 
sampling algorithm is {xo)i = = 1,..., 256. The Hessian model is constructed using the r = 4 largest 

estimated eigenvalues and eigenvectors from Arnoldi sampling, which is run with m = 16 iterations. 

We consider two models for error. The first model adds Gaussian noise with mean zero to the function, 
F, and its gradient, VF. The noise has a standard deviation equal to 2.5% of F{xq) and ||VF(xo)|| for the 
function and gradient components, respectively. The second model for the error is also Gaussian and has 
the same standard deviation, but each component of the gradient error has a mean of 0.1|| VF(xo)||, i.e. the 
gradient has a biased error. 

The two variants of g were used to solve the subproblem 0, based on F(x), for the three eigenvalue 
spectra, and For each spectra, the methods were applied 1000 times, and 

statistics for the relative change in the objective, F{xo +p)/F(xo), were gathered. Figure 5(a) shows the 
median (square symbol) for the relative change in the objective when the error has no bias. The bars show 
the 0.025 and 0.975 quantiles of probability. Figure [5(1^ plots the same results when bias is present. 

When the error has no bias, it is clear that the step-average approach outperforms the directional- 
derivative approach for this set of quadratics. Indeed, for spectra with rapid decay, the objective increases 
on average using the directional-derivative approach. It must be emphasized, again, that these results are 
sensitive to the choice of a] the directional-derivative variant relies on a finite-difference step-size that is 
accurate for both the Hessian-vector product and gradient approximations. When the objective function 
varies rapidly in some directions, as it does for = 1/i^, a good step size for both the Hessian-vector 
products and gradient is difficult to find and, in fact, may not exist. 

As expected, the results change when the gradient error has bias. In particular, we see that the directional- 
derivative approach is relatively insensitive to the addition of bias, i.e. for a given spectra, the results are 
similar. In contrast, the step-average approach performs more poorly when the spectrum decays slowly; 
however, the step-average approach continues to outperform the directional-derivative variant when the 
spectrum decays rapidly. 


9 offl^ 


American Institute of Aeronautics and Astronautics 








F/Fo 



T step-average 
T dir.-derivative 


(a) results when gradient error has no bias 


F/Fo 



T step-average 
T dir.-derivative 


(b) results when gradient error has bias 


Figure 5. Comparison of the two methods of computing g — the step-average and directional-derivative 
approaches — without bias in the gradient error (figure a) and with bias in the gradient error (figure b). 


D. Optimization Framework 

When the objective function is nonlinear, the optimization algorithm must use some form of globalization. 
To this end, we have adopted a trust-region framework [5^ for this work. We will refer to the overall 


optimization algorithm, listed in Algorithm as the Stochastic Arnoldi’s Method, or SAM for short. Note 
that there are two variants of SAM; one based on the step-average and one based on the directional-derivative 
method to compute g. 

Each iteration of SAM begins by assessing convergence using the quadratic model’s gradient norm, ||^||. 
In the case of the step-average variant, the criterion is based on the average gradient norm. The directional- 
derivative variant uses the gradient norm of ^fred- 

If the convergence criterion is not met, SAM solves the (small) trust-region subproblem (§ using the 
More and Sorensen algorithm 54 . In Algorithm this subproblem is written in terms of the reduced- 
space y] note that the step length satisfies ||p|| = ||Vy|| = ||^||, because the approximate eigenvectors are 
orthonormal. 

Subsequently, the step is used to obtain the trial solution Xnew The trial-solution formula depends 
on whether the step-average or directional-derivative variant is adopted. Once the trial solution has been 
computed, SAM follows a fairly standard trust-radius update based on the ratio, p, of the actual objective 
reduction to the model’s prediction of the reduction. One departure from conventional trust-region methods 
arises if the step is rejected: in this case the proposed method re-evaluates the objective and gradient at the 
current solution estimate. 

While trust-region algorithms are well studied for accurate and even inexacf|^ function evaluations, their 
use with imperfect data is less developed. We plan a careful study of the theoretical properties of Algorithmj^ 
in future work, but make no guarantees regarding its convergence at present. 


‘^Inexact refers to errors that can be controlled, e.g. errors due to incomplete convergence of a solver. 
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Algorithm 3: Stochastic Arnoldi’s Method. 

Data: xo, r (approximate-Hessian rank), m (Arnoldi iterations), a (sample radius), A (initial trust 
radius), and r (convergence tolerance) 

Result: approximate minimum, x 


1 

2 

3 

4 

5 

6 
7 


set X ^ xq, and compute Fi^i = fjx) and G:^i = g{x) 

use Arnoldi sampling (Algorithm^ to obtain X, F, G, V, and A 

for /c = 1, 2,... do 

if Il5ll < r then (check convergence) 

I return 
end 

Solve trust-region optimization problem in the reduced space: 


min q{y) = f + + -y'^Ay, 

y 2 


subject to ll^ll < A. 


8 compute Xnew ^ ^ (step-average) or Xnew ^ x -\-\/y (directional-derivative) 

9 compute fk = /(Xnew) and Qk = ^(^new) 

10 compute p = ared/pred = -{fk-i - fk)/q{y) 

11 if p < 0.1 then 

12 I A ^ A/4 

13 else 

14 if p > 3/4 and ||p|| < A then 

15 I A i min(2A, Aj^iax) 

16 end 

17 end 

18 if p > 10“^ then 

19 I set X ^ ^new, aud update F.^i = //. and G:^i = pk 

20 else 

21 I keep X and resample F.^i = f{x) and G:^i = g{x) 

22 end 

23 use Arnoldi sampling (Algorithmto update X. 2 :(m+i )5 G:, 2 :(m+i): and 

24 end 
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Figure 6. Comparison of the SAM variants with BFGS and Nelder-Mead on the multi-dimensional Rosenbrock 
function with Gaussian noise added. 


IV. Results 


We conclude by benchmarking SAM against two established optimization algorithms: the (derivative- 
based) BFGS 55 quasi-Newton method and the (derivative-free) Nelder-Mead algorithm [^. For this 
study, we optimize a modified multi-dimensional Rosenbrock function: 

n/2 


Fix) = ^ - [l00(a;2i - + (1 - X2i-if] , 




where n = 256. This definition differs from the standard multi-dimensional Rosenbrock function 56 with 
the introduction of the scaling factor 1/i. This factor ensures the Hessian has a decaying spectrum, which 
is an underlying assumption in Arnoldi sampling and, consequently, SAM. 

As before, we consider both unbiased and biased Gaussian noise with standard deviations of 2.5% of 
F{xo) and ||VF(xo)|| for the function and gradient components, respectively. For the biased-noise case, each 
entry in the gradient has a mean error of 0.1|| VF(xo) ||. 

We benchmark the two variants of SAM against the BFGS and Nelder-Mead implementations in Matlab®. 
The initial iterate for SAM and BFGS is given by 


ixo)i = 


— 1, if i is odd, 
0, if i is even 


For Nelder-Mead, xq forms one of the vertices of the initial simplex. The parameters used for SAM are as 
follows: r = 4, m = 16, a = 0.5, A = 10||xo||, and r = 0.1. In addition, SAM is limited to a maximum of 10 
iterations. 

Figure [^compares the median objective reduction achieved by the two SAM variants, BFGS, and Nelder- 
Mead. The lower and upper bars denote the 0.025 and 0.975 probability quantiles, respectively. The step- 
average variant of SAM reduces the objective by approximately two orders of magnitude for both the unbiased 
and biased errors. Somewhat surprisingly, the step-average method performs better than the directional- 
derivative method in the biased-error case; this may be due to the choice of model radius a. Both variants 
of SAM perform significantly better than the BFGS and Nelder-Mead simplex methods. 

Figure compares the convergence histories of SAM with those of BFGS and Nelder-Mead. Only the 
step-average variant of SAM is shown. The BFGS convergence stalls in both cases, because the errors cause 
the line search to fail. These histories, which are typical, demonstrate that the performance of SAM comes 
at the cost of additional functional evaluations. 


V. Summary and Conclusions 

Numerical optimization has helped engineers in numerous fields; however, there remain important appli¬ 
cations that cannot use conventional optimization algorithms. In this work, we have targeted applications 
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(a) Unbiased noise (b) Biased noise 

Figure 7. Sample convergence histories for the SAM (step-average variant), BFGS, and Nelder-Mead algo¬ 
rithms. 


that have large-dimensional design spaces and whose outputs are imperfect, i.e. their outputs and derivatives 
contain irreducible errors that are incompatible with most gradient-based algorithms. To enable optimization 
for these applications, we have developed a high-dimensional sampling method based on Arnoldi’s method. 

Arnoldi sampling adaptively selects new sample points and tends to capture the dominant eigenmodes of 
the (spatially averaged) Hessian. The sampling strategy could be used in conjunction with a nonparametric 
regression, e.g. Gaussian-process regression, to build surrogate models for optimization when errors are 
present in the objective and its derivatives; however, in the present work, we have focused on conventional 
quadratic models. 

We investigated two methods of constructing the linear term that appears in the quadratic model. The 
first approach involved averaging over all possible sample points and gradients. The second approach used 
directional derivatives of the objective function to estimate the gradient in the reduced space. The directional- 
derivative variant was found to be effective when biased errors were present and the Hessian’s spectrum was 
slowly decaying; however, in general, the step-average approach was more effective. 

A trust-region algorithm called the Stochastic Arnoldi’s Method (SAM) was proposed and used to mini¬ 
mize the multi-dimensional Rosenbrock function. Synthetic errors were introduced in the evaluation of the 
objective and its gradient. The performance of SAM on this problem was promising relative to derivative- 
based (BFGS) and derivative-free (Nelder-Mead) algorithms. In particular, whereas BFGS had inconsistent 
convergence and Nelder-Mead failed to converge, the Arnoldi-based algorithm consistently reduced the ob¬ 
jective by two orders of magnitude. 
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